Data-Driven Insights for Artificial Flower E-Commerce: A Comprehensive Analysis with R and Python

Author

Srivatsava Chinnasamy Kamaraj

1 Introduction

1.1 Business Problem and Relevance

The global e-commerce market has experienced exponential growth, with online platforms enabling businesses to connect with consumers worldwide. The artificial flower industry is no exception, with suppliers competing on factors like pricing, order quantity, supplier experience, and verification status. However, without data-driven insights, businesses struggle to optimize supplier selection and pricing strategies, leading to missed opportunities for growth.

This study focuses on analyzing supplier performance, pricing trends, and customer engagement factors using a dataset from Kaggle (Artificial Flower E-Commerce Dataset). By leveraging statistical analysis and visualization techniques, I aim to uncover key business insights that can help suppliers and e-commerce platforms improve decision-making.


1.2 Objectives

This project aims to address the following key business questions:

  1. Supplier Performance: Which suppliers receive the highest order quantities, and what factors contribute to their success?
  2. Pricing Strategies: How do suppliers price their products, and does price impact order quantity?
  3. Verification Status Impact: Do verified suppliers receive higher order volumes than non-verified ones?
  4. Country-Wise Supplier Dominance: Which countries have the most high-volume suppliers?
  5. Supplier Experience & Growth: Does a supplier’s years in business affect order quantity?
  6. Statistical Validation: How do pricing, supplier rating, and review count correlate with sales?

By answering these questions, this project will provide actionable insights for suppliers, buyers, and e-commerce platforms.


1.3 Methodology

To ensure a structured, reproducible, and data-driven approach, I followed these steps:

  1. Data Acquisition & Preparation:
    • Retrieved data from Kaggle.
    • Performed data cleaning, missing value imputation, and feature extraction.
    • Standardized supplier names, order quantities, and pricing for consistency.
  2. Exploratory Data Analysis (EDA):
    • Used descriptive statistics and visualizations to identify patterns in supplier performance, order distribution, and pricing.
    • Examined supplier verification status, review counts, and supplier country.
  3. Business Analytics (Statistical Methods):
    • Correlation Analysis: Identified relationships between price, rating, review count, and order quantity.
    • T-Test: Checked if verified suppliers receive more orders than non-verified ones.
    • ANOVA: Analyzed differences in order quantity based on supplier experience level.
  4. Integration with Python (Quarto):
    • Implemented cross-language analysis using Python for data transformation and visualization.
  5. Database Storage (SQLite):
    • Stored the cleaned dataset in an SQLite database for efficient query-based analysis.
  6. Business Insights & Recommendations:
    • Derived insights on pricing strategies, supplier trust factors, and customer behavior.
    • Formulated strategic recommendations for suppliers to optimize their pricing and sales approach.

1.4 Expected Outcomes

This study will produce:  A comprehensive data-driven report highlighting supplier performance and pricing strategies.
Statistical validation of key business trends using T-Tests, ANOVA, and correlation analysis.
Actionable recommendations for suppliers and e-commerce platforms to improve sales and pricing strategies.
A structured, reproducible analytical pipeline that integrates R and Python for business analytics.

By applying data analytics techniques, this study contributes to a better understanding of supplier dynamics in the artificial flower e-commerce industry, enabling businesses to make informed decisions.


2 Packages

2.1 Load Required Libraries

# Data Manipulation and Processing
library(tidyverse)   
library(dplyr)       
library(stringr)     

# Data Visualization
library(ggplot2)     
library(plotly)      
library(gt)          

# Statistical Analysis
library(car)         
library(reshape2)    

# Database Management
library(DBI)         
library(RSQLite)     

# Python Integration with R
library(reticulate)  

2.2 Citations

[Wickham et al. (2019)](Wickham et al. 2023)[Wickham (2023)](Wickham 2016)[Sievert (2020)](Iannone et al. 2024)[Fox and Weisberg (2019)](Wickham 2007)[, Wickham, and Müller (2024)](Müller et al. 2024)(Ushey, Allaire, and Tang 2025)

3 Data Preparation

###Load and Inspect the Dataset

# Load dataset
df <- read.csv("Artificial_Flower_Ecommerce_Data.csv", stringsAsFactors = FALSE)

# Convert empty strings to NA
df[df == ""] <- NA

# Check dataset structure
str(df)
'data.frame':   4803 obs. of  8 variables:
 $ products_name      : chr  "Decorative LED Tree Flower Lights cherry blossom trees" "Cheap artificial flowers New Garden cherry blossom" "ADULT DIY MINI CONCRETE FLOWER POT FOR LIVINGROOM GARDEN" "Real Touch Calla Lily Decoration Flower Bridal Bouquet Flower Arrangements Flore Decorations 9pcs/bundle" ...
 $ prices             : chr  "US$99.00 - US$399.00" "US$0.80 - US$1.15" "US$2.15 - US$3.15" "US$2.56 - US$3.06" ...
 $ delivery_info      : chr  "Shipping to be negotiatedMin. order: 10 pieces" "Shipping to be negotiatedMin. order: 2.0 piecesEasy Return" "Shipping to be negotiatedMin. order: 500 boxesEasy Return" "Shipping to be negotiatedMin. order: 144 piecesEasy Return" ...
 $ order_quantity     : chr  NA "12 orders" NA "1 order" ...
 $ supplier_name      : chr  "Zhongshan Sun Neon Lighting Factory" "Tianjin Wuqing Meiyu Craft Flower Co., Ltd." "Ningbo SW Co., Ltd." "Shenzhen Oscare Trade Co., Ltd." ...
 $ verification_status: chr  "Verified" "Verified" "Verified" "Verified" ...
 $ years_in_business  : chr  "13 yrs" "2 yrs" "5 yrs" "5 yrs" ...
 $ supplier_info      : chr  "13 yrsCN 5.0/5.0 (1)" "2 yrsCN 4.8/5.0 (82)" "5 yrsCN 5.0/5.0 (2)" "5 yrsCN 4.8/5.0 (5)" ...
glimpse(df)
Rows: 4,803
Columns: 8
$ products_name       <chr> "Decorative LED Tree Flower Lights cherry blossom …
$ prices              <chr> "US$99.00 - US$399.00", "US$0.80 - US$1.15", "US$2…
$ delivery_info       <chr> "Shipping to be negotiatedMin. order: 10 pieces", …
$ order_quantity      <chr> NA, "12 orders", NA, "1 order", NA, "1,782 sold ",…
$ supplier_name       <chr> "Zhongshan Sun Neon Lighting Factory", "Tianjin Wu…
$ verification_status <chr> "Verified", "Verified", "Verified", "Verified", NA…
$ years_in_business   <chr> "13 yrs", "2 yrs", "5 yrs", "5 yrs", "1 yr", "6 yr…
$ supplier_info       <chr> "13 yrsCN 5.0/5.0 (1)", "2 yrsCN 4.8/5.0 (82)", "5…
# Summary statistics
summary(df)
 products_name         prices          delivery_info      order_quantity    
 Length:4803        Length:4803        Length:4803        Length:4803       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 supplier_name      verification_status years_in_business  supplier_info     
 Length:4803        Length:4803         Length:4803        Length:4803       
 Class :character   Class :character    Class :character   Class :character  
 Mode  :character   Mode  :character    Mode  :character   Mode  :character  
# Check for missing values
if (nrow(df) > 0) {
  missing_values <- colSums(is.na(df))
  print(missing_values)
  
  missing_percentage <- missing_values / nrow(df) * 100
  print(missing_percentage)
} else {
  message("Dataset is empty. Cannot calculate missing values.")
}
      products_name              prices       delivery_info      order_quantity 
                  0                   0                   0                2461 
      supplier_name verification_status   years_in_business       supplier_info 
                  0                3304                   0                   0 
      products_name              prices       delivery_info      order_quantity 
            0.00000             0.00000             0.00000            51.23881 
      supplier_name verification_status   years_in_business       supplier_info 
            0.00000            68.79034             0.00000             0.00000 

3.1 Understanding the Data

The dataset consists of 4,803 rows and 8 columns, representing various aspects of artificial flower e-commerce data. Most columns are text-based and require conversion for proper analysis.

3.2 Missing Values Overview

  • order_quantity has 2,461 missing values (51.3%) requiring imputation.
  • verification_status has 3,304 missing values (68.8%) and will be replaced with “Not Verified”.

3.3 Categorical vs. Numerical Data

  • Categorical Columns: products_name, supplier_name, verification_status, supplier_info, delivery_info.
  • Columns Requiring Numeric Conversion: prices (split into min_price & max_price), order_quantity, years_in_business, supplier_rating.

3.4 Handling Missing Values

# Function to handle missing verification_status
handle_verification_status <- function(data) {
  data |> 
    mutate(verification_status = replace_na(verification_status, "Not Verified"))
}

# Function to handle missing order_quantity using median or mean
handle_order_quantity <- function(data) {
  data$order_quantity <- as.numeric(gsub("[^0-9]", "", data$order_quantity))
  order_quantity_no_na <- na.omit(data$order_quantity)
  
  order_quantity_skewness <- shapiro.test(order_quantity_no_na)
  
  imputed_value <- ifelse(order_quantity_skewness$p.value >= 0.05, 
                          mean(order_quantity_no_na, na.rm = TRUE), 
                          median(order_quantity_no_na, na.rm = TRUE))
  
  data |> 
    mutate(order_quantity = ifelse(is.na(order_quantity), imputed_value, order_quantity))
}

# Apply the functions
df <- df |> 
  handle_verification_status() |> 
  handle_order_quantity()

# Check missing values handled
df |> summarise(across(everything(), ~sum(is.na(.))))
  products_name prices delivery_info order_quantity supplier_name
1             0      0             0              0             0
  verification_status years_in_business supplier_info
1                   0                 0             0
df |> 
  select(verification_status, order_quantity) |>  
  head(10)
   verification_status order_quantity
1             Verified              4
2             Verified             12
3             Verified              4
4             Verified              1
5         Not Verified              4
6             Verified           1782
7             Verified              4
8             Verified           5178
9             Verified             15
10            Verified           1520

3.4.1 Explanation of Handling Missing Values

  • verification_status: Replaced missing values with “Not Verified”.
  • order_quantity: Extracted numeric values, checked normality using Shapiro-Wilk test, and imputed with mean (if normal) or median (if skewed).

This approach ensures clean and structured data, ready for further analysis.

3.5 Extracting and Converting Prices

extract_prices <- function(data) {
  tryCatch({
    # Fix price extraction by splitting values and handling numeric conversion
df <- df |> mutate(
  prices = gsub("US\\$", "", prices),  # Remove 'US$'
  prices = gsub(",", "", prices),  # Remove commas
  min_price = as.numeric(str_extract(prices, "^[0-9]+\\.?[0-9]*")),  # Extract first price
  max_price = as.numeric(str_extract(prices, "(?<=- )[0-9]+\\.?[0-9]*")),  # Extract second price
  max_price = ifelse(is.na(max_price), min_price, max_price)  # If max is missing, set to min
)

      
  }, error = function(e) {
    message("Error in extract_prices(): ", e$message)
    return(data)
  })
}

# Apply function
suppressWarnings(suppressMessages({
  df <- extract_prices(df)
}))

# Check for missing values
df |> summarise(min_price_na = sum(is.na(min_price)), max_price_na = sum(is.na(max_price)))
  min_price_na max_price_na
1            0            0
df |> select(prices, min_price, max_price) |> head(10)
           prices min_price max_price
1  99.00 - 399.00     99.00    399.00
2     0.80 - 1.15      0.80      1.15
3     2.15 - 3.15      2.15      3.15
4     2.56 - 3.06      2.56      3.06
5     5.20 - 6.00      5.20      6.00
6            2.30      2.30      2.30
7     5.88 - 6.86      5.88      6.86
8     0.80 - 0.89      0.80      0.89
9   33.30 - 75.00     33.30     75.00
10           8.48      8.48      8.48

3.5.1 Explanation for Extracting and Converting Prices

In the Extracting and Converting Prices section, we focused on cleaning and extracting relevant numerical information from the prices column, which contains values in the format “US\(min_price - US\)max_price”. The process involves:

  1. Removing the ‘US$’ symbol from the prices column to facilitate numeric conversion.
  2. Extracting the minimum price from the string using regular expressions (str_extract).
  3. Extracting the maximum price from the string, and if missing, assigning the value of the minimum price to ensure consistency in the data.

This ensures that we have valid numerical values for both min_price and max_price, making the dataset suitable for analysis, such as price comparison and trend analysis.

3.6 Converting Years in Business

convert_years_in_business <- function(data) {
  data |> 
    mutate(years_in_business = as.numeric(str_extract(years_in_business, "[0-9]+")))
}

# Apply function
df <- convert_years_in_business(df)

# Check for missing values
df |> summarise(years_in_business_na = sum(is.na(years_in_business)))
  years_in_business_na
1                    0
df |> select(years_in_business) |> head(10)
   years_in_business
1                 13
2                  2
3                  5
4                  5
5                  1
6                  6
7                  1
8                 14
9                  6
10                 3

3.6.1 Explanation for Converting Years in Business

In the Converting Years in Business section, we focused on extracting the numeric values from the years_in_business column, which contains text like “5 yrs” or “2 yrs”. The process involves:

  1. Extracting the numeric value (e.g., 5 from "5 yrs") using a regular expression (str_extract).
  2. Converting the extracted values into a numeric format to prepare the data for analysis.

3.7 Standardize supplier name

standardize_supplier_name <- function(data) {
  data |> mutate(supplier_name = str_trim(tolower(supplier_name)))
}

# Apply function
df <- standardize_supplier_name(df)

3.7.1 Explanation Standardizing Supplier Name

In the Standardizing Supplier Name section, we ensure uniformity in supplier names to avoid duplicates and inconsistencies in data analysis. Supplier names may have capitalization differences, extra spaces, or slight spelling variations, which can affect grouping and aggregation.

The process involves: 1. Converting all supplier names to lowercase to standardize text formatting. 2. Trimming extra spaces from names using str_trim(), ensuring consistency.

3.8 Extracting minimum order quantity

extract_min_order_quantity <- function(data) {
  data |> mutate(min_order_quantity = as.numeric(str_extract(delivery_info, "(?<=Min\\. order: )[0-9]+")))
}

# Apply function
df <- extract_min_order_quantity(df)

3.8.1 Explanation

In the Extracting Minimum Order Quantity section, we focus on retrieving the minimum order quantity from the delivery_info column. This column contains text data such as:

Since the minimum order quantity is embedded in text, we extract it using regular expressions.

The process involves: 1. Identifying and extracting the numeric value that appears after "Min. order:" using str_extract(). 2. Converting the extracted value into a numeric format for further analysis.

3.9 Extracting Supplier Rating, Country & Review Count

extract_supplier_details <- function(data) {
  data |> mutate(
    supplier_rating = as.numeric(str_extract(supplier_info, "[0-9]\\.[0-9](?=/5\\.0)")),  # Extract rating (e.g., 4.8 from "4.8/5.0")
    review_count = as.numeric(str_extract(supplier_info, "(?<=\\()[0-9]+(?=\\))")),  # Extract review count (e.g., 82 from "(82)")
    supplier_country = str_extract(supplier_info, "(?<=yrs?)[A-Z]+")  # Extract country code (e.g., "CN" from "5 yrsCN")
  )
}

# Apply function
df <- extract_supplier_details(df)

3.9.1 Explanation

The supplier_info column contains multiple pieces of information, including supplier rating, review count, and country. Extracting these details allows for structured analysis.

Key Actions: - Supplier Rating: Extracts numeric rating (e.g., 4.8 from "4.8/5.0") for performance analysis. - Review Count: Retrieves the number of customer reviews (e.g., 82 from "(82)") to measure supplier credibility. - Supplier Country: Extracts country code (e.g., "CN" from "5 yrsCN") for regional comparisons.

3.10 Verify Transformations

df |> select(supplier_name, min_order_quantity, supplier_rating, review_count,supplier_country) |> head(10)
                                     supplier_name min_order_quantity
1              zhongshan sun neon lighting factory                 10
2      tianjin wuqing meiyu craft flower co., ltd.                  2
3                              ningbo sw co., ltd.                500
4                  shenzhen oscare trade co., ltd.                144
5  beijing jiaxunda technology and trade co., ltd.                 10
6             qingdao golden handicrafts co., ltd.                100
7                     l&c home (zhongshan) limited                 10
8       qingdao castle industry and trade co., ltd                100
9       qingdao ouli international trade co., ltd.                  5
10                   qingdao royal craft co., ltd.                 10
   supplier_rating review_count supplier_country
1              5.0            1               CN
2              4.8           82               CN
3              5.0            2               CN
4              4.8            5               CN
5              5.0            2               CN
6              4.7          129               CN
7              5.0            1               CN
8              4.7          188               CN
9              4.7           70               CN
10             4.8          144               CN
# Checking for missing values across all columns
df |> summarise(across(everything(), ~sum(is.na(.))))
  products_name prices delivery_info order_quantity supplier_name
1             0      0             0              0             0
  verification_status years_in_business supplier_info min_price max_price
1                   0                 0             0         0         0
  min_order_quantity supplier_rating review_count supplier_country
1                  0             847          847                0

we have 847 NA values in supplier_rating and review_count Extracted numeric values, checked normality using Shapiro-Wilk test, and imputed with mean (if normal) or median (if skewed).

# Function to handle missing supplier_rating and review_count using Shapiro-Wilk test
handle_supplier_details <- function(data) {
  # Removing NAs before performing normality test
  supplier_rating_no_na <- na.omit(data$supplier_rating)
  review_count_no_na <- na.omit(data$review_count)
  
  # Perform Shapiro-Wilk test for normality
  supplier_rating_shapiro <- shapiro.test(supplier_rating_no_na)
  review_count_shapiro <- shapiro.test(review_count_no_na)
  
  # Determine imputation value based on normality test
  rating_imputed <- ifelse(supplier_rating_shapiro$p.value >= 0.05, 
                           mean(supplier_rating_no_na, na.rm = TRUE), 
                           median(supplier_rating_no_na, na.rm = TRUE))
  
  review_imputed <- ifelse(review_count_shapiro$p.value >= 0.05, 
                            mean(review_count_no_na, na.rm = TRUE), 
                            median(review_count_no_na, na.rm = TRUE))
  
  # Impute missing values
  data |> 
    mutate(
      supplier_rating = ifelse(is.na(supplier_rating), rating_imputed, supplier_rating),
      review_count = ifelse(is.na(review_count), review_imputed, review_count)
    )
}

# Apply the function
df <- handle_supplier_details(df)

# Check missing values after imputation
df |> summarise(
  supplier_rating_na = sum(is.na(supplier_rating)),
  review_count_na = sum(is.na(review_count))
)
  supplier_rating_na review_count_na
1                  0               0
# Verify the updated dataset
# Checking for missing values across all columns
df |> summarise(across(everything(), ~sum(is.na(.))))
  products_name prices delivery_info order_quantity supplier_name
1             0      0             0              0             0
  verification_status years_in_business supplier_info min_price max_price
1                   0                 0             0         0         0
  min_order_quantity supplier_rating review_count supplier_country
1                  0               0            0                0
# Convert verification_status into a factor
df <- df |> mutate(verification_status = as.factor(verification_status))

str(df$verification_status)  # Check structure of verification_status
 Factor w/ 2 levels "Not Verified",..: 2 2 2 2 1 2 2 2 2 2 ...
nrow(df)  # Check total rows before removing duplicates
[1] 4803
df <- df |> distinct()  # Remove duplicate rows
nrow(df)  # Check total rows after removing duplicates
[1] 4487
# Remove unwanted columns
df <- df |> select(-prices, -delivery_info, -supplier_info)

# Confirm the updated dataset structure
str(df)
'data.frame':   4487 obs. of  11 variables:
 $ products_name      : chr  "Decorative LED Tree Flower Lights cherry blossom trees" "Cheap artificial flowers New Garden cherry blossom" "ADULT DIY MINI CONCRETE FLOWER POT FOR LIVINGROOM GARDEN" "Real Touch Calla Lily Decoration Flower Bridal Bouquet Flower Arrangements Flore Decorations 9pcs/bundle" ...
 $ order_quantity     : num  4 12 4 1 4 ...
 $ supplier_name      : chr  "zhongshan sun neon lighting factory" "tianjin wuqing meiyu craft flower co., ltd." "ningbo sw co., ltd." "shenzhen oscare trade co., ltd." ...
 $ verification_status: Factor w/ 2 levels "Not Verified",..: 2 2 2 2 1 2 2 2 2 2 ...
 $ years_in_business  : num  13 2 5 5 1 6 1 14 6 3 ...
 $ min_price          : num  99 0.8 2.15 2.56 5.2 2.3 5.88 0.8 33.3 8.48 ...
 $ max_price          : num  399 1.15 3.15 3.06 6 2.3 6.86 0.89 75 8.48 ...
 $ min_order_quantity : num  10 2 500 144 10 100 10 100 5 10 ...
 $ supplier_rating    : num  5 4.8 5 4.8 5 4.7 5 4.7 4.7 4.8 ...
 $ review_count       : num  1 82 2 5 2 129 1 188 70 144 ...
 $ supplier_country   : chr  "CN" "CN" "CN" "CN" ...
summary(df)
 products_name      order_quantity    supplier_name        verification_status
 Length:4487        Min.   :    1.0   Length:4487        Not Verified:3082    
 Class :character   1st Qu.:    4.0   Class :character   Verified    :1405    
 Mode  :character   Median :    4.0   Mode  :character                        
                    Mean   :  102.9                                           
                    3rd Qu.:    4.0                                           
                    Max.   :22898.0                                           
 years_in_business   min_price         max_price       min_order_quantity
 Min.   : 1.000    Min.   :   0.01   Min.   :   0.02   Min.   :    1.0   
 1st Qu.: 1.000    1st Qu.:   0.56   1st Qu.:   0.75   1st Qu.:    2.0   
 Median : 3.000    Median :   1.43   Median :   1.82   Median :   20.0   
 Mean   : 3.893    Mean   :  20.46   Mean   :  37.11   Mean   :  149.8   
 3rd Qu.: 5.000    3rd Qu.:   5.50   3rd Qu.:   8.00   3rd Qu.:  100.0   
 Max.   :21.000    Max.   :1600.00   Max.   :4500.00   Max.   :80000.0   
 supplier_rating  review_count    supplier_country  
 Min.   :2.300   Min.   :  1.00   Length:4487       
 1st Qu.:4.700   1st Qu.:  5.00   Class :character  
 Median :4.800   Median : 15.00   Mode  :character  
 Mean   :4.762   Mean   : 25.55                     
 3rd Qu.:4.900   3rd Qu.: 28.00                     
 Max.   :5.000   Max.   :505.00                     

4 More Feature Engineering

4.1 Create Price Features

# Load the cleaned dataset
df <- read.csv("Cleaned_Artificial_Flower_Data.csv")

# Create price spread and average price
df <- df |>
  mutate(
    price_spread = max_price - min_price,
    avg_price = (min_price + max_price) / 2
  )

Insights: - price_spread helps understand how much suppliers vary their prices. - avg_price simplifies analysis by providing a single metric instead of two separate columns.


4.2 Categorize Supplier Experience

# Categorizing years in business
df <- df |>
  mutate(
    experience_category = case_when(
      years_in_business < 3 ~ "New",
      years_in_business >= 3 & years_in_business <= 7 ~ "Growing",
      years_in_business > 7 ~ "Established"
    )
  )

Insights: - Instead of treating years_in_business as a continuous variable, grouping it into categories helps in comparisons. - This is useful for ANOVA and understanding the impact of supplier experience on sales.


4.3 Log Transform Order Quantity

# Apply log transformation to order quantity
df <- df |>
  mutate(log_order_quantity = log1p(order_quantity))

Insights: - order_quantity is highly skewed; some products have very high order counts. - Log transformation normalizes the distribution, making statistical tests and regression models more reliable.


# display df
str(df)
'data.frame':   4487 obs. of  15 variables:
 $ products_name      : chr  "Decorative LED Tree Flower Lights cherry blossom trees" "Cheap artificial flowers New Garden cherry blossom" "ADULT DIY MINI CONCRETE FLOWER POT FOR LIVINGROOM GARDEN" "Real Touch Calla Lily Decoration Flower Bridal Bouquet Flower Arrangements Flore Decorations 9pcs/bundle" ...
 $ order_quantity     : int  4 12 4 1 4 1782 4 5178 15 1520 ...
 $ supplier_name      : chr  "zhongshan sun neon lighting factory" "tianjin wuqing meiyu craft flower co., ltd." "ningbo sw co., ltd." "shenzhen oscare trade co., ltd." ...
 $ verification_status: chr  "Verified" "Verified" "Verified" "Verified" ...
 $ years_in_business  : int  13 2 5 5 1 6 1 14 6 3 ...
 $ min_price          : num  99 0.8 2.15 2.56 5.2 2.3 5.88 0.8 33.3 8.48 ...
 $ max_price          : num  399 1.15 3.15 3.06 6 2.3 6.86 0.89 75 8.48 ...
 $ min_order_quantity : int  10 2 500 144 10 100 10 100 5 10 ...
 $ supplier_rating    : num  5 4.8 5 4.8 5 4.7 5 4.7 4.7 4.8 ...
 $ review_count       : int  1 82 2 5 2 129 1 188 70 144 ...
 $ supplier_country   : chr  "CN" "CN" "CN" "CN" ...
 $ price_spread       : num  300 0.35 1 0.5 0.8 0 0.98 0.09 41.7 0 ...
 $ avg_price          : num  249 0.975 2.65 2.81 5.6 ...
 $ experience_category: chr  "Established" "New" "Growing" "Growing" ...
 $ log_order_quantity : num  1.609 2.565 1.609 0.693 1.609 ...
# Write the cleaned dataset to a CSV file
write.csv(df, "Cleaned_Artificial_Flower_Data1.csv", row.names = FALSE)

5 Exploratory Data Analysis

5.1 Exploring Data Distribution with Boxplots

To understand the spread of key numerical variables and detect potential outliers, we use boxplots for: - Order Quantity - Min Price - Max Price - Order Quantity by Supplier Rating

5.1.1 Boxplots with Tabset for Interactive Viewing

ggplot(df, aes(y = order_quantity)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red") +
  labs(title = "Boxplot of Order Quantity", y = "Order Quantity") +
  theme_minimal()

ggplot(df, aes(x = as.factor(supplier_rating), y = order_quantity)) +
  geom_boxplot(fill = "purple", outlier.color = "red") +
  labs(title = "Boxplot of Order Quantity by Supplier Rating", x = "Supplier Rating", y = "Order Quantity") +
  theme_minimal()

ggplot(df, aes(y = min_price)) +
  geom_boxplot(fill = "lightgreen", outlier.color = "red") +
  labs(title = "Boxplot of Min Price", y = "Min Price") +
  theme_minimal()

ggplot(df, aes(y = max_price)) +
  geom_boxplot(fill = "orange", outlier.color = "red") +
  labs(title = "Boxplot of Max Price", y = "Max Price") +
  theme_minimal()

Insights from Boxplots

  • Order Quantity:
    • Extreme outliers in order quantity indicate that some suppliers receive very high order volumes.
    • The median order quantity is relatively low, meaning most suppliers receive small to moderate orders.
    • The log transformation applied earlier helps normalize this skewed distribution.
  • Min Price:
    • Some suppliers list very low prices, possibly as promotional or bulk discounts.
    • A wide variation in min prices suggests different pricing strategies among suppliers.
    • Presence of outliers indicates extreme low-cost products that may require further analysis.
  • Max Price:
    • Significant variation in max prices shows different pricing tiers (retail vs. bulk pricing).
    • Some suppliers set high max prices, likely for premium artificial flowers.
    • Outliers in max price suggest some suppliers inflate their max price significantly.
  • Order Quantity by Supplier Rating:
    • Higher supplier ratings do not necessarily guarantee higher order quantities.
    • Wide spread of order quantity within each rating level suggests that other factors influence sales volume.
    • Some low-rated suppliers still receive large orders, potentially due to competitive pricing or other business strategies.

5.2 Exploring Data Distribution with Histograms

To further analyze data distribution, we use histograms for: - Order Quantity - Min Price - Max Price - Log-Transformed Order Quantity

5.2.1 Histograms with Tabset for Interactive Viewing

ggplot(df, aes(x = order_quantity)) +
  geom_histogram(bins = 50, fill = "blue", alpha = 0.7) +
  labs(title = "Distribution of Order Quantity", x = "Order Quantity", y = "Count") +
  theme_minimal()

ggplot(df, aes(x = min_price)) +
  geom_histogram(bins = 50, fill = "lightgreen", alpha = 0.7) +
  labs(title = "Distribution of Min Price", x = "Min Price", y = "Count") +
  theme_minimal()

ggplot(df, aes(x = max_price)) +
  geom_histogram(bins = 50, fill = "orange", alpha = 0.7) +
  labs(title = "Distribution of Max Price", x = "Max Price", y = "Count") +
  theme_minimal()

ggplot(df, aes(x = log_order_quantity)) +
  geom_histogram(bins = 50, fill = "purple", alpha = 0.7) +
  labs(title = "Distribution of Log-Transformed Order Quantity", x = "Log(Order Quantity)", y = "Count") +
  theme_minimal()

Insights from Histograms

  • Order Quantity:
    • The raw order quantity is highly right-skewed, confirming the presence of large order outliers.
    • Log transformation helps normalize the distribution, making it better suited for statistical analysis.
  • Min Price:
    • Most suppliers list very low minimum prices, with a sharp drop-off as prices increase.
    • A significant number of suppliers set their minimum prices close to zero, aligning with boxplot insights that showed extreme low-price outliers.
  • Max Price:
    • Prices vary greatly, and a small number of suppliers have extremely high maximum prices.
    • The long right tail in the max price histogram suggests that some suppliers charge premium prices, confirming the variability seen in the boxplot.

6 Correlation Analysis

To understand relationships between numerical variables, we compute and visualize a correlation matrix. This helps identify key factors influencing order quantity and pricing behavior.

6.1 Interactive Correlation Heatmap

library(ggplot2)
library(reshape2)
library(plotly)

# Compute correlation matrix
cor_matrix <- cor(df |> select(where(is.numeric)), use = "complete.obs")
cor_long <- melt(cor_matrix)

# Create interactive heatmap
p <- ggplot(cor_long, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab") +
  theme_minimal() +
  labs(title = "Correlation Heatmap of Numeric Features")

# Convert ggplot to interactive plotly chart
ggplotly(p)

Insights from Correlation Analysis

  • Review Count has the strongest correlation with Order Quantity (~0.15) → Suggesting customer engagement drives sales more than price or rating.
  • Years in Business has a weak correlation with Order Quantity (~0.067) but a stronger correlation with Review Count (~0.32), meaning older suppliers attract more reviews but not necessarily higher sales.
  • Price (Min/Max) shows minimal correlation with Order Quantity → Suggesting that pricing alone does not dictate sales volume.
  • Supplier Rating has little impact on Order Quantity → Confirming that higher-rated suppliers do not necessarily receive more orders.

7 Data Analysis and Insights

We now analyze key business questions using visualizations and summary statistics.

7.1 1️⃣ What is the distribution of order quantities across different suppliers?

library(ggplot2)

# Visualizing Order Quantity Distribution by Supplier
ggplot(df, aes(x = supplier_name, y = order_quantity)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red") +
  labs(title = "Order Quantity Distribution Across Suppliers",
       x = "Supplier",
       y = "Order Quantity") +
  theme(axis.text.x = element_blank())

Insights: - Most suppliers receive low-to-moderate order quantities. - Some suppliers have significantly higher order volumes, indicating potential top-performing sellers. - Outliers suggest extreme sales cases, which may be worth investigating separately.


7.2 2️⃣ Do verified suppliers receive more orders compared to non-verified suppliers?

# Boxplot: Order Quantity by Verification Status
ggplot(df, aes(x = verification_status, y = order_quantity, fill = verification_status)) +
  geom_boxplot(outlier.color = "red") +
  labs(title = "Order Quantity by Supplier Verification Status",
       x = "Verification Status",
       y = "Order Quantity") +
  theme_minimal()

Insights: - No significant difference in order quantity between verified & non-verified suppliers. - Both verified and non-verified suppliers have outliers, suggesting that verification is not a strong indicator of sales volume. - This aligns with the correlation analysis, where verification status showed a weak correlation with order quantity.


7.3 3️⃣ Which country has the highest average order quantity?

library(gt)

# Average Order Quantity by Country
df |> 
  group_by(supplier_country) |> 
  summarise(avg_order_quantity = mean(order_quantity)) |> 
  arrange(desc(avg_order_quantity)) |> 
  head(10) |> 
  gt() |> 
  tab_header(title = "Top 10 Countries by Average Order Quantity")
Top 10 Countries by Average Order Quantity
supplier_country avg_order_quantity
CN 105.1077
HK 4.0000
IN 4.0000
JP 4.0000
MY 4.0000
TH 4.0000
TR 4.0000
TW 4.0000
VN 4.0000

Insights: - China (CN) has the highest average order quantity, significantly surpassing all other countries. - Most other countries have uniformly low average order quantities, suggesting smaller-scale operations. - The large gap between China and other countries indicates a strong bulk order trend from Chinese suppliers.


7.4 4️⃣ Do suppliers with higher ratings tend to have lower or higher prices?

# Scatter Plot: Supplier Rating vs. Average Price
ggplot(df, aes(x = supplier_rating, y = avg_price)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Supplier Rating vs. Average Price",
       x = "Supplier Rating",
       y = "Average Price") +
  theme_minimal()

Insights: - No strong relationship between supplier rating and price. - Higher-rated suppliers do not necessarily charge higher prices. - Some low-rated suppliers have high prices, which may indicate brand reputation or premium products.


7.5 5️⃣ Which suppliers have the highest price variations (difference between min and max price)?

library(gt)

# Calculating average price spread for each supplier
df |> 
  group_by(supplier_name) |> 
  summarise(avg_price_spread = mean(price_spread, na.rm = TRUE)) |> 
  arrange(desc(avg_price_spread)) |> 
  head(10) |> 
  gt() |> 
  tab_header(title = "Top 10 Suppliers by Average Price Variation")
Top 10 Suppliers by Average Price Variation
supplier_name avg_price_spread
guangdong songtao co., ltd. 2281.4250
green lantern optoelectronic light factory 659.1000
guangzhou meihua arts and crafts co., ltd. 500.0000
d.s.t. exports 495.0000
qingdao shengyabo international trade co., ltd. 398.4127
shenzhen yirong industrial co., ltd. 395.6000
zhongshan sun neon lighting factory 300.0000
qingdao meishihua international trade co., ltd. 284.2028
qingdao eagle international trading company limited 254.3029
zhengzhou einek handicrafts co., ltd. 220.7143

Insights: - Guangdong Songtao Co., Ltd. has the highest average price spread, suggesting significant pricing flexibility. - A steep drop in price variation after the top supplier indicates that few suppliers operate with highly dynamic pricing. - Suppliers with higher average price variation may offer better negotiation opportunities for bulk purchases. - The presence of suppliers with moderate pricing variability suggests a mix of fixed and dynamic pricing strategies in the market.

8 Statistical Test

8.1 T-Test: Is there a significant difference in average order quantity between verified and non-verified suppliers?

  • T-test is used when comparing two groups (verified vs. non-verified suppliers).
  • We’ll check if there is a significant difference in order quantity between these two groups.
  • We’ll first check normality (Shapiro-Wilk test) and variance equality (Levene’s test) before running the independent t-test.
# Check normality of order quantity for both groups
shapiro.test(df$log_order_quantity[df$verification_status == "Verified"])

    Shapiro-Wilk normality test

data:  df$log_order_quantity[df$verification_status == "Verified"]
W = 0.74147, p-value < 2.2e-16
shapiro.test(df$log_order_quantity[df$verification_status == "Not Verified"])

    Shapiro-Wilk normality test

data:  df$log_order_quantity[df$verification_status == "Not Verified"]
W = 0.58974, p-value < 2.2e-16
# Check variance equality using Levene's test
leveneTest(log_order_quantity ~ verification_status, data = df)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    1  70.101 < 2.2e-16 ***
      4485                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform independent T-test (Welch's T-test for unequal variances)
t_test_result <- t.test(log_order_quantity ~ verification_status, data = df, var.equal = FALSE)
print(t_test_result)

    Welch Two Sample t-test

data:  log_order_quantity by verification_status
t = -5.2889, df = 2303.7, p-value = 1.347e-07
alternative hypothesis: true difference in means between group Not Verified and group Verified is not equal to 0
95 percent confidence interval:
 -0.3581331 -0.1643923
sample estimates:
mean in group Not Verified     mean in group Verified 
                  1.903275                   2.164538 

Insights from T-Test Analysis

  • Normality Check (Shapiro-Wilk Test): Both verified and non-verified suppliers’ log-transformed order quantities are not normally distributed (p < 0.05).
  • Variance Check (Levene’s Test): Variances are highly unequal (p < 2.2e-16), so Welch’s T-test is used.
  • T-Test Results:
    • p-value = 1.347e-07Statistically significant at the 0.05 level.
    • Mean Log Orders: Verified (2.1645) vs. Non-Verified (1.9033).
    • 95% Confidence Interval (-0.3581, -0.1644) does not include zero, confirming a significant difference.

Conclusion:

  • Verified suppliers receive significantly higher order quantities than non-verified suppliers (p < 0.05).
  • The difference is statistically significant, meaning verification has a real impact on sales.
  • Business Implication: Verification status is important for increasing orders. Non-verified suppliers might benefit from getting verified to boost credibility and sales.

8.2 ANOVA: Does the supplier experience level (New, Growing, Established) impact average order quantity?

  • ANOVA is used when comparing more than two groups.
  • Here, we check if order quantity differs across supplier experience levels (New, Growing, Established).
  • If ANOVA is significant, we perform Tukey’s HSD test for pairwise comparisons.
# Check normality for each experience level
shapiro.test(df$log_order_quantity[df$experience_category == "New"])

    Shapiro-Wilk normality test

data:  df$log_order_quantity[df$experience_category == "New"]
W = 0.52393, p-value < 2.2e-16
shapiro.test(df$log_order_quantity[df$experience_category == "Growing"])

    Shapiro-Wilk normality test

data:  df$log_order_quantity[df$experience_category == "Growing"]
W = 0.7296, p-value < 2.2e-16
shapiro.test(df$log_order_quantity[df$experience_category == "Established"])

    Shapiro-Wilk normality test

data:  df$log_order_quantity[df$experience_category == "Established"]
W = 0.69711, p-value < 2.2e-16
# Check variance equality using Levene's test
leveneTest(log_order_quantity ~ experience_category, data = df)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    2   109.9 < 2.2e-16 ***
      4484                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform ANOVA
anova_result <- aov(log_order_quantity ~ experience_category, data = df)
summary(anova_result)
                      Df Sum Sq Mean Sq F value Pr(>F)    
experience_category    2    320  159.93   80.69 <2e-16 ***
Residuals           4484   8887    1.98                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Perform Tukey HSD post-hoc test if ANOVA is significant
tukey_result <- TukeyHSD(anova_result)
print(tukey_result)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = log_order_quantity ~ experience_category, data = df)

$experience_category
                           diff        lwr        upr     p adj
Growing-Established  0.01001782 -0.1574220  0.1774577 0.9892115
New-Established     -0.53045055 -0.6987636 -0.3621375 0.0000000
New-Growing         -0.54046837 -0.6447752 -0.4361615 0.0000000

Insights from ANOVA Analysis

  • Normality Check (Shapiro-Wilk Test): Log-transformed order quantity is not normally distributed across all experience levels.
  • Variance Check (Levene’s Test): Variances are highly unequal (p < 2.2e-16), but ANOVA remains robust.
  • ANOVA Results:
    • p-value < 0.05 → Supplier experience significantly impacts order quantity.
  • Tukey’s HSD Test Results:
    • New vs. Established: p < 0.0001 (Significant) → Established suppliers receive significantly more orders than New suppliers.
    • New vs. Growing: p < 0.0001 (Significant) → Growing suppliers also receive more orders than New suppliers.
    • Growing vs. Established: p = 0.9892 (Not Significant) → No major difference between these two groups.

Conclusion:

  • Supplier experience significantly affects order quantity (ANOVA p < 0.05).
  • New suppliers receive significantly fewer orders than Growing and Established suppliers.
  • No significant difference between Growing and Established suppliers, meaning sales stabilize after a certain experience level.
  • Business Implication: New suppliers should focus on building credibility through better pricing, promotions, and customer engagement to compete with experienced suppliers.

8.3 ANOVA: Do suppliers from different countries have significantly different average prices?

  • ANOVA is used to check if average price varies significantly between suppliers from different countries.
  • If ANOVA is significant, we perform Tukey’s HSD test for pairwise comparisons.
# Compute avg_price if not already computed
df$avg_price <- (df$min_price + df$max_price) / 2

# Check normality of average price\shapiro.test(df$avg_price)

# Check variance equality using Levene's test
leveneTest(avg_price ~ supplier_country, data = df)
Levene's Test for Homogeneity of Variance (center = median)
        Df F value Pr(>F)
group    8  0.3748 0.9344
      4478               
# Perform ANOVA
anova_country <- aov(avg_price ~ supplier_country, data = df)
summary(anova_country)
                   Df   Sum Sq Mean Sq F value Pr(>F)
supplier_country    8    26245    3281   0.343  0.949
Residuals        4478 42870861    9574               
# Perform Tukey HSD post-hoc test if ANOVA is significant
tukey_country <- TukeyHSD(anova_country)
print(tukey_country)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = avg_price ~ supplier_country, data = df)

$supplier_country
               diff        lwr       upr     p adj
HK-CN -27.611291110 -142.46735  87.24477 0.9981034
IN-CN  -6.375021294  -45.20851  32.45847 0.9998864
JP-CN -21.692719681 -325.36588 281.98045 0.9999998
MY-CN  41.337280319 -173.41652 256.09108 0.9996247
TH-CN -23.056469681 -130.50669  84.39375 0.9991644
TR-CN -11.187719681 -225.94152 203.56608 1.0000000
TW-CN -26.180411989 -110.51920  58.15838 0.9890155
VN-CN -26.177719681 -240.93152 188.57608 0.9999885
IN-HK  21.236269816  -99.83376 142.30630 0.9998127
JP-HK   5.918571429 -318.68473 330.52187 1.0000000
MY-HK  68.948571429 -174.50390 312.40105 0.9940547
TH-HK   4.554821429 -152.59308 161.70272 1.0000000
TR-HK  16.423571429 -227.02890 259.87605 0.9999999
TW-HK   1.430879121 -140.91708 143.77884 1.0000000
VN-HK   1.433571429 -242.01890 244.88605 1.0000000
JP-IN -15.317698387 -321.39519 290.75979 1.0000000
MY-IN  47.712301613 -170.42810 265.85270 0.9990395
TH-IN -16.681448387 -130.74979  97.38690 0.9999530
TR-IN  -4.812698387 -222.95310 213.32770 1.0000000
TW-IN -19.805390695 -112.42865  72.81787 0.9991855
VN-IN -19.802698387 -237.94310 198.33770 0.9999989
MY-JP  63.030000000 -308.84980 434.90980 0.9998554
TH-JP  -1.363750000 -323.42110 320.69360 1.0000000
TR-JP  10.505000000 -361.37480 382.38480 1.0000000
TW-JP  -4.487692308 -319.58834 310.61296 1.0000000
VN-JP  -4.485000000 -376.36480 367.39480 1.0000000
TH-MY -64.393750000 -304.44113 175.65363 0.9959029
TR-MY -52.525000000 -356.16358 251.11358 0.9998313
TW-MY -67.517692308 -298.14775 163.11237 0.9925610
VN-MY -67.515000000 -371.15358 236.12358 0.9989175
TR-TH  11.868750000 -228.17863 251.91613 1.0000000
TW-TH  -3.123942308 -139.56653 133.31864 1.0000000
VN-TH  -3.121250000 -243.16863 236.92613 1.0000000
TW-TR -14.992692308 -245.62275 215.63737 0.9999999
VN-TR -14.990000000 -318.62858 288.64858 1.0000000
VN-TW   0.002692308 -230.62737 230.63275 1.0000000
# Boxplot visualization of price differences across supplier countries
ggplot(df, aes(x = supplier_country, y = avg_price)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red") +
  labs(title = "Average Price Distribution Across Supplier Countries",
       x = "Supplier Country",
       y = "Average Price") +
  theme_minimal()

Insights from ANOVA Analysis

  • No significant difference in average prices across supplier countries (p = 0.9495).
  • Levene’s test confirms equal variances, making ANOVA results reliable.
  • Despite price variations in some countries (outliers in the boxplot), overall pricing remains similar globally.
  • Business Implication: Country of origin does not strongly influence pricing—buyers should focus more on quality, supplier reputation, and order quantity rather than geographical pricing trends.

8.4 Pearson and Spearman Correlation test: Does review count correlate with order quantity?

library(ggplot2)

# Check normality of order quantity and review count
shapiro.test(df$log_order_quantity)

    Shapiro-Wilk normality test

data:  df$log_order_quantity
W = 0.6459, p-value < 2.2e-16
shapiro.test(df$review_count)

    Shapiro-Wilk normality test

data:  df$review_count
W = 0.6379, p-value < 2.2e-16
# Compute Pearson and Spearman correlation
cor_pearson <- cor.test(df$log_order_quantity, df$review_count, method = "pearson")
cor_spearman <- cor.test(df$log_order_quantity, df$review_count, method = "spearman")

# Perform Linear Regression
lm_model <- lm(log_order_quantity ~ review_count, data = df)
summary(lm_model)

Call:
lm(formula = log_order_quantity ~ review_count, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.9987 -0.5666 -0.2721 -0.1346  8.0256 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.7342673  0.0257913   67.24   <2e-16 ***
review_count 0.0098171  0.0005982   16.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.392 on 4485 degrees of freedom
Multiple R-squared:  0.05664,   Adjusted R-squared:  0.05643 
F-statistic: 269.3 on 1 and 4485 DF,  p-value: < 2.2e-16
# Extract regression coefficients
intercept <- coef(lm_model)[1]
slope <- coef(lm_model)[2]
eq_label <- paste0("y = ", round(slope, 4), "x + ", round(intercept, 4),
                   "\nPearson r = ", round(cor_pearson$estimate, 4))

# Scatter plot with regression line and equation
ggplot(df, aes(x = review_count, y = log_order_quantity)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_smooth(method = "lm", color = "red") +
  annotate("text", x = max(df$review_count) * 0.7, y = max(df$log_order_quantity) * 0.9,
           label = eq_label, color = "black", size = 5, hjust = 0) +
  labs(title = "Review Count vs. Log Order Quantity",
       x = "Review Count",
       y = "Log(Order Quantity)") +
  theme_minimal()

Correlation Analysis: Review Count & Log Order Quantity

  • The Shapiro-Wilk test confirms that both log-transformed order quantity and review count are not normally distributed (p < 0.0001), making Spearman correlation the more reliable measure.
  • Pearson (r = 0.238, p < 0.0001) and Spearman (r = 0.121, p < 0.0001) correlations indicate a weak but statistically significant positive relationship between review count and order quantity.
  • The regression model (y = 0.0098x + 1.7343) suggests that each additional review slightly increases log order quantity.
  • A low R² value indicates that review count alone is not a strong predictor of sales.

Conclusion - More reviews slightly contribute to higher order volumes, but their impact is minimal. - Other factors like pricing, supplier experience, and product reputation play a much larger role in sales. - Suppliers should encourage reviews but also focus on pricing, product differentiation, and marketing strategies for better sales outcomes.

9 Integration with Python

9.1 Demonstrating Communication Between R and Python Using Quarto

Quarto allows us to execute both R and Python in the same document, enabling seamless communication between the two languages.

# Load reticulate package
library(reticulate)

# Show Python configuration
py_config()
python:         C:/Users/Srivatsava CK/AppData/Local/Programs/Python/Python313/python.exe
libpython:      C:/Users/Srivatsava CK/AppData/Local/Programs/Python/Python313/python313.dll
pythonhome:     C:/Users/Srivatsava CK/AppData/Local/Programs/Python/Python313
version:        3.13.1 (tags/v3.13.1:0671451, Dec  3 2024, 19:06:28) [MSC v.1942 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/Srivatsava CK/AppData/Local/Programs/Python/Python313/Lib/site-packages/numpy
numpy_version:  2.2.1

NOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK

9.2 Which supplier countries have the highest total and average order quantities for each verification status?

import pandas as pd
df_py = r.df

# Group by supplier country and verification status
order_quantity_by_country = df_py.groupby(["supplier_country", "verification_status"]).agg(
    total_orders = ("order_quantity", "sum"),
    avg_orders = ("order_quantity", "mean")
).reset_index()

# Display result
print(order_quantity_by_country.head())
  supplier_country verification_status  total_orders  avg_orders
0               CN        Not Verified        261594   87.401938
1               CN            Verified        199829  143.041518
2               HK        Not Verified            28    4.000000
3               IN        Not Verified           248    4.000000
4               JP        Not Verified             4    4.000000

Insights:

  • China (CN) dominates both total and average order quantities, regardless of verification status.
  • Verified suppliers tend to receive higher average order quantities, showing the importance of trust in supplier selection.
  • Smaller markets like HK, IN, and JP have significantly lower total orders, indicating a regional preference in sourcing.

9.3 How do supplier experience levels impact pricing strategies across countries?

# Group by supplier country and experience category
pricing_by_experience = df_py.groupby(["supplier_country", "experience_category"]).agg(
    avg_min_price=("min_price", "mean"),
    avg_max_price=("max_price", "mean"),
    avg_price=("avg_price", "mean")
).reset_index()

# Display result
print(pricing_by_experience.head())
  supplier_country experience_category  avg_min_price  avg_max_price  avg_price
0               CN         Established      31.878378      58.413467  45.145922
1               CN             Growing      19.546012      38.339181  28.942596
2               CN                 New      19.487013      31.230396  25.358704
3               HK         Established       1.234000       1.850000   1.542000
4               HK             Growing       1.250000       1.380000   1.315000

Insights:

  • Established suppliers in China set the highest average prices, while newer suppliers adopt lower pricing strategies to compete.
  • Suppliers in Hong Kong have much lower price ranges, likely indicating a different market positioning or cost structure.
  • Pricing varies based on supplier experience, with experienced suppliers likely charging a premium for reputation and reliability.

9.4 What is the distribution of order quantities across different price ranges?

import matplotlib.pyplot as plt
import seaborn as sns

# Categorize avg_price into Low, Medium, High price ranges
q1, q2 = df_py["avg_price"].quantile([0.33, 0.66])

def categorize_price(price):
    if price <= q1:
        return "Low"
    elif price <= q2:
        return "Medium"
    else:
        return "High"

df_py["price_range"] = df_py["avg_price"].apply(categorize_price)

# Aggregate order quantity by price range
order_quantity_by_price = df_py.groupby("price_range")["order_quantity"].agg(["mean", "sum", "count"]).reset_index()

# Visualization
plt.figure(figsize=(8, 5))
sns.barplot(x="price_range", y="mean", data=order_quantity_by_price, palette="Blues")
plt.title("Average Order Quantity Across Price Ranges")
plt.xlabel("Price Range")
plt.ylabel("Average Order Quantity")
plt.show()

📌 Insights:

  • Low and Medium price ranges receive the highest order volumes, while high-priced products see lower demand.
  • Bulk purchasing behavior favors affordability, reinforcing the trend of customers preferring cost-effective solutions.
  • Suppliers targeting high sales volumes should optimize pricing for the lower segments.

10 SQL Integration

library(DBI)
library(RSQLite)

Connect to SQLite Database

db_file <- "artificial_flower_db.sqlite"
conn <- dbConnect(SQLite(), db_file)

Store Data in SQLite

# Remove existing table if it exists
dbExecute(conn, "DROP TABLE IF EXISTS flower_data")
[1] 0
# Create new table and write data
dbWriteTable(conn, "flower_data", df, overwrite = TRUE, row.names = FALSE)

Retrieve Data for Analysis

# Fetch all data
query_all <- dbGetQuery(conn, "SELECT * FROM flower_data LIMIT 5")
print(query_all)
                                                                                             products_name
1                                                   Decorative LED Tree Flower Lights cherry blossom trees
2                                                       Cheap artificial flowers New Garden cherry blossom
3                                                 ADULT DIY MINI CONCRETE FLOWER POT FOR LIVINGROOM GARDEN
4 Real Touch Calla Lily Decoration Flower Bridal Bouquet Flower Arrangements Flore Decorations 9pcs/bundle
5    Artificial flower Unicorn PE foam rose bear to send girlfriend birthday Valentine gift eternal flower
  order_quantity                                   supplier_name
1              4             zhongshan sun neon lighting factory
2             12     tianjin wuqing meiyu craft flower co., ltd.
3              4                             ningbo sw co., ltd.
4              1                 shenzhen oscare trade co., ltd.
5              4 beijing jiaxunda technology and trade co., ltd.
  verification_status years_in_business min_price max_price min_order_quantity
1            Verified                13     99.00    399.00                 10
2            Verified                 2      0.80      1.15                  2
3            Verified                 5      2.15      3.15                500
4            Verified                 5      2.56      3.06                144
5        Not Verified                 1      5.20      6.00                 10
  supplier_rating review_count supplier_country price_spread avg_price
1             5.0            1               CN       300.00   249.000
2             4.8           82               CN         0.35     0.975
3             5.0            2               CN         1.00     2.650
4             4.8            5               CN         0.50     2.810
5             5.0            2               CN         0.80     5.600
  experience_category log_order_quantity
1         Established          1.6094379
2                 New          2.5649494
3             Growing          1.6094379
4             Growing          0.6931472
5                 New          1.6094379

10.1 top suppliers with highest average order quantity?

# Fetch top suppliers with highest average order quantity
query_top_suppliers <- dbGetQuery(conn, "
    SELECT supplier_name, 
           AVG(CAST(REPLACE(order_quantity, ' orders', '') AS FLOAT)) AS avg_order_quantity
    FROM flower_data
    WHERE order_quantity IS NOT NULL AND order_quantity != ''
    GROUP BY supplier_name
    ORDER BY avg_order_quantity DESC
    LIMIT 5
")
print(query_top_suppliers)
                                     supplier_name avg_order_quantity
1     henan huaxin import & export trade co., ltd.           7000.000
2                yunnan rongsheng flower co., ltd.           2268.500
3 baoding junhou import and export trade co., ltd.           2045.833
4   fujian outstanding import and export co., ltd.           1705.304
5      yiwu junnuo electronic technology co., ltd.           1500.000

Insights:

  • Henan Huaxin Import & Export Trade Co., Ltd. leads with 7000 orders.
  • Significant gap between the 1st and 2nd supplier.
  • The top 3 suppliers dominate the market.
  • Business Implication: Competitors should analyze top suppliers’ strategies.

10.2 How do average, minimum, and maximum prices vary across different supplier countries?

# Fetch price trends by country
query_price_trends <- dbGetQuery(conn, "
    SELECT supplier_country, 
           avg(avg_price) avgprice, 
           min(min_price) minprice, 
           max(max_price) maxprice
    FROM flower_data
    GROUP BY supplier_country
    ORDER BY avg_price DESC
")
print(query_price_trends)
  supplier_country  avgprice minprice maxprice
1               CN 29.037720     0.01  4500.00
2               IN 22.662698     0.10  1000.00
3               MY 70.375000     0.55   200.00
4               TR 17.850000     5.40    50.00
5               TH  5.981250     0.20    13.00
6               JP  7.345000     6.40     8.29
7               TW  2.857308     0.90     7.32
8               VN  2.860000     1.86     3.86
9               HK  1.426429     0.86     3.62

Insights from Price Trends by Country

  • China (CN) has the highest average price (29.04) but also the widest price range (0.01 to 4500.00), indicating a mix of low-cost and premium-priced suppliers.
  • India (IN) follows with an average price of 22.66, with prices ranging from 0.10 to 1000.00, showing a diverse pricing structure.
  • Malaysia (MY) has the highest average price (70.38), but its maximum price (200.00) is much lower than China’s, suggesting a more standardized pricing model.
  • Turkey (TR), Thailand (TH), and Japan (JP) have relatively lower max prices, indicating a more controlled pricing strategy.
  • Vietnam (VN), Taiwan (TW), and Hong Kong (HK) have the lowest average prices, suggesting these markets may focus more on budget-friendly or low-cost products.
  • Business Implication: Suppliers from China and India offer a wide range of prices, making them suitable for both budget and high-end buyers. Malaysia has a consistently high average price, indicating a focus on premium products. Buyers looking for cost-effective sourcing may consider Vietnam, Taiwan, or Hong Kong for lower-cost alternatives.
# --- 4. Close Database Connection ---
dbDisconnect(conn)

11 Business Insights and Recommendations

11.1 Business Insights

  1. Order Quantity Trends
    • Order quantities are highly skewed, with a few suppliers receiving bulk orders while most get low to moderate sales.
    • Extreme outliers exist, suggesting some suppliers dominate the market.
  2. Price Analysis
    • Wide variation in pricing strategies among suppliers, with some using low minimum prices to attract customers while others maintain high price spreads.
    • No strong correlation between price and order quantity, indicating that customers do not always buy based on the lowest price.
  3. Supplier Verification Impact
    • Verified suppliers receive significantly higher order volumes than non-verified ones, confirming buyer trust in verification status.
  4. Supplier Experience & Order Quantity
    • Experienced suppliers receive higher order quantities, while new suppliers struggle to gain traction.
    • No major difference between Growing and Established suppliers, indicating order volume stabilizes after 3-7 years in business.
  5. Country-Wise Supplier Performance
    • China (CN) dominates in both total and average order volumes, while other countries show significantly lower averages.
    • Suppliers from other regions may cater to niche markets or higher-priced segments.
  6. Supplier Rating & Review Count Impact
    • Review count is positively correlated with order quantity, while rating alone does not drive more sales.
    • Suppliers with more reviews tend to have higher credibility.
  7. Pricing Strategies of High-Volume Suppliers
    • Mid-range pricing is the most effective, as suppliers with extreme low or high prices do not have the highest order volumes.
  8. Minimum Order Quantity (MOQ) Trends
    • Suppliers with lower MOQs attract more buyers, while high MOQs limit flexibility and discourage orders.
  9. Price Variation Among Suppliers
    • Some suppliers have extreme price ranges, which could indicate bulk pricing discounts or inconsistent pricing strategies.
    • Suppliers with stable pricing attract more consistent buyers.
  10. Review Count and Supplier Longevity
  • Older suppliers have higher review counts, confirming credibility builds over time**.
  • Some new suppliers get high reviews quickly, suggesting strong early-stage marketing or aggressive promotion strategies**.

11.2 Business Recommendations

  1. Encourage Supplier Verification
    • Since verified suppliers receive more orders, I recommend incentivizing non-verified suppliers to get verified through promotional benefits.
  2. Promote Bulk Purchasing
    • Suppliers should offer volume-based discounts to increase order sizes.
  3. Optimize Pricing Strategies
    • Avoid extreme pricing (too low or too high)mid-range pricing attracts the most sales.
    • Analyze competitive pricing trends and adjust pricing strategies based on market demand.
  4. Encourage Customer Reviews
    • Suppliers should actively collect and showcase reviews to build credibility.
    • Providing incentives (discounts, loyalty programs) to buyers who leave reviews can increase engagement.
  5. Analyze Supplier Experience and Performance
    • New suppliers should use promotional strategies to gain traction.
    • Growing suppliers should focus on customer retention and branding to transition into Established status.
  6. Leverage Country-Based Supplier Trends
    • Bulk buyers should prioritize China-based suppliers, as they dominate in order volume.
    • For niche or premium markets, suppliers from other regions may offer unique advantages.
  7. Encourage Flexible Minimum Order Quantities (MOQs)
    • Lower MOQ requirements attract more buyers, so suppliers should offer flexible order options.
    • Bulk sellers should provide tiered pricing models for different order sizes.
  8. Standardize Pricing Strategies
    • Suppliers with large price variations should ensure clear bulk pricing tiers to avoid confusion.
  9. Strategic Marketing for New Suppliers
    • New suppliers should focus on aggressive marketing and promotions to quickly build credibility and reviews.
    • Older suppliers should leverage their existing reputation to maintain competitive advantage.

12 Conclusion

Through my data-driven analysis, I have identified key insights into supplier performance, pricing strategies, verification impact, and order quantity trends.

  • Verified and experienced suppliers receive more orders, reinforcing the importance of credibility and trust-building.
  • Price alone does not determine order volume—other factors like review count, supplier experience, and verification status play crucial roles.
  • Bulk orders are concentrated among a few top suppliers, emphasizing the need for competitive pricing and promotions.
  • China dominates supplier order volumes, but other countries cater to niche markets.

12.1 Strategic Actions:

  • Encourage verification for increased sales
  • Adjust pricing to mid-range levels for better market positioning
  • Use review collection strategies to enhance trust
  • Offer bulk pricing and flexible MOQs to attract more buyers

By implementing these recommendations, I believe businesses can optimize supplier selection, pricing models, and customer engagement strategies to maximize success in the artificial flower e-commerce market.

References

Fox, John, and Sanford Weisberg. 2019. “An r Companion to Applied Regression.” https://www.john-fox.ca/Companion/.
Iannone, Richard, Joe Cheng, Barret Schloerke, Ellis Hughes, Alexandra Lauer, JooYoung Seo, Ken Brevoort, and Olivier Roy. 2024. “Gt: Easily Create Presentation-Ready Display Tables.” https://CRAN.R-project.org/package=gt.
Müller, Kirill, Hadley Wickham, David A. James, and Seth Falcon. 2024. “RSQLite: SQLite Interface for r.” https://CRAN.R-project.org/package=RSQLite.
R Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and Kirill Müller. 2024. “DBI: R Database Interface.” https://CRAN.R-project.org/package=DBI.
Sievert, Carson. 2020. “Interactive Web-Based Data Visualization with r, Plotly, and Shiny.” https://plotly-r.com.
Ushey, Kevin, JJ Allaire, and Yuan Tang. 2025. “Reticulate: Interface to ’Python’.” https://CRAN.R-project.org/package=reticulate.
Wickham, Hadley. 2007. “Reshaping Data with the Reshape Package” 21. http://www.jstatsoft.org/v21/i12/.
———. 2016. “Ggplot2: Elegant Graphics for Data Analysis.” https://ggplot2.tidyverse.org.
———. 2023. “Stringr: Simple, Consistent Wrappers for Common String Operations.” https://CRAN.R-project.org/package=stringr.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. “Dplyr: A Grammar of Data Manipulation.” https://CRAN.R-project.org/package=dplyr.